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Abstract 

In this paper we show that by carefully making good 
choices for various detailed but important factors in a vi¬ 
sual recognition framework using deep learning features, 
one can achieve a simple, efficient, yet highly accurate im¬ 
age classification system. We first list 5 important factors, 
based on both existing researches and ideas proposed in this 
paper. These important detailed factors include: 1) £2 ttia- 
trix normalization is more effective than unnormalized or £2 
vector normalization, 2) the proposed natural deep spatial 
pyramid is very effective, and 3) a very small K in Fisher 
Vectors surprisingly achieves higher accuracy than nor¬ 
mally used large K values. Along with other choices (con¬ 
volutional activations and multiple scales), the proposed 
DSP framework is not only intuitive and efficient, but also 
achieves excellent classification accuracy on many bench¬ 
mark datasets. For example, DSP’s accuracy on SUN397 is 
59.78%, significantly higher than previous state-of-the-art 
(53.86%). 

1 . Introduction 

Feature representation is among the most important top¬ 
ics (if not the most important one) in current state-of-the- 
art visual recognition tasks. Over the past decade, hand¬ 
crafted features (e.g., SIFT and HOG) were very popular, 
and they were often encoded into a high dimensional vec¬ 
tor by the Bag-of-Visual-Words (BOVW) framework [18]. 
The BOVW representation is further improved by the Vec¬ 
tor of the Locally Aggregated Descriptors (VLAD) [10] and 
Fisher Vector (FV) [14] methods, via adding higher order 
statistics. However, such features are significantly outper¬ 
formed by the recent deep features from convolutional neu¬ 
ral networks (CNNs), which have exhibited significantly 
better performance than those handcrafted features in visual 
recognition. 

In spite of the impressive results achieved by deep fea¬ 
tures, there are many factors which can affect the perfor¬ 
mance of deep feature representations. A lot of factors exist 


and many details will have huge impact in CNN feature’s 
recognition accuracy. Those factors include, for example, 
how the deep net is trained. Zhou et al. [30] evaluated 
deep feature’s performance from the same network archi¬ 
tecture learned from different training sets (i.e., ImageNet 
and Places data). They achieved high classification perfor¬ 
mance on scene recognition tasks with the Places-CNN fea¬ 
ture. Chatheld et al. [1] studied other factors, including ar¬ 
chitectures of deep nets and data augmentation, etc. 

After a deep net has been successfully trained, more fac¬ 
tors and decisions are awaiting. In other words, how shall 
we use the deep features for image recognition? Studies 
have been carried out very recently, and some important de¬ 
tails have been worked on. However, a systematic study of 
“what factors are out there?” and “what choices should 
be made?” is missing. In this paper, we present our stud¬ 
ies to these questions. Specifically, suppose we are given a 
pre-trained deep CNN model, 

• What are important factors in utilizing this model? 
Based on existing studies in the literature and our new 
proposals, we make a list of^ve important factors. 

• What decisions are the best concerning these fac¬ 
tors? We carefully evaluate different choices and 
present our answers to this question. Some choices (e.g., 
the choice of K size in FV) are quite different from pre¬ 
vious practices in the community. 

• What effects do these factors have? We show that they 
are key to high recognition accuracy. By combining the 
best choices from the 5 factors we raised, we propose 
Deep Spatial Pyramid (DSP), a framework that prop¬ 
erly utilize deep CNN features. DSP has the following 
properties: 

* High accuracy. DSP updates the accuracy of many 
benchmark datasets in our evaluation. For example, 
it raises the accuracy of SUN 397 from 53.86% [30] 
to 59.78%, and Caltech 101 from 93.42% [17] to 
95.11%. Note that these previous state-of-the-art re¬ 
sults are also based on CNN. 

* High efficiency and flexibility. DSP achieves high 
processing speed, with roughly 150 ms to process 
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an image. DSP also processes images of any aspect 
ratio or resolution. 

* Small storage cost. The final DSP representation 
is memory-efficient, with around 12k dimensions. 
This length is much shorter than existing combina¬ 
tion of CNN features and FV / VLAD, and is advan¬ 
tageous in large-scale problems. 

We will first present the framework, preliminaries, and 
the list of important factors in Sec. 2. The study of best 
decisions for these factors are presented in Sec. 3. How¬ 
ever, the study of K size is very special, as to have its own 
Sec. 4. DSP is evaluated as a whole system in Sec. 5, and 
it is compared with state-of-the-art visual recognition meth¬ 
ods. Sec. 6 concludes this paper. 

2. The framework and important factors 

Our study follow the framework illustrated in Fig. 1 . In 
the first step, we feed an input image with arbitrary resolu¬ 
tion into a pre-trained CNN model to extract deep activa¬ 
tions. Then, a visual dictionary with K dictionary items is 
trained on the deep descriptors from training images. The 
third step overlay a spatial pyramid partition to the deep ac¬ 
tivations of an image into m blocks in N pyramid levels. 
One spatial block is represented as a vector by using the 
improved Fisher Vector. Thus, m blocks correspond to m 
FVs. In the fourth and final step, we concatenate the m 
FVs to form a 2mdAr-dimensional feature vector as the fi¬ 
nal image-level representation. 

Our framework does not consider how the pre-trained 
CNN is obtained or how an image is classified after its rep¬ 
resentation is obtained. These can be viewed as prelimi¬ 
nary factors, and we follow the commonly used decisions 
for them in the literature. 

In practice, some CNN models {e.g., Krizhevsky et 
al. [11] and Zeiler and Fergus [28]) are popularly used as 
the deep feature extractor in image related tasks. However, 
recently neural networks that are even deeper than these are 
shown to further improve CNN performance, characterized 
by deeper and wider architectures and smaller convolutional 
filters when compared to traditional CNN such as [11,28]. 
Examples of deeper nets include GoogLeNet [19] and VGG 
Net-D [17]. Our work is based on the network architecture 
released by [17] {i.e., VGG Net-D). This network consists 
of 13 layers of 3 x 3 convolutional kernels, with 5 max¬ 
pooling layers interspersed, and in the end concluded by 
3 fully connected layers. The width of this network starts 
from 64 in the first layer, increasing by a factor of 2 after 
each max-pooling layer, until it reaches 512. For the classi¬ 
fication, we use a linear SVM classifier. 

In the rest of this paper, we follow the notations in [6]. 
We use the term “feature map” to indicate the convolutional 
results (after applying the max-pooling) of one filter, the 


term “activations” to indicate feature maps of all filters in 
a convolutional layer, and the term “descriptor” to indicate 
the d-dimensional component vector of activations, “pools” 
refers to the activations of the max-pooled last convolu¬ 
tional layer, and “fcs” refers to the activation of the last fully 
connected layer. 

With these preliminaries and notations, we now discuss 
the important factors inside this framework. 

1 Which activation to use? Deep features for an image 
can be extracted from either the convolutional layers or 
the fully connected layers of a pre-trained CNN. The 
original idea is to use the last fully connected layer di¬ 
rectly for classification [11]. And recently, activations 
from the fully convolutional layers have exemplified its 
value [28, 13, 2, 25]. Which one shall we adopt? 

2 How to normalize the deep features before feeding them 
into a classifier or the next level of processing? It is not 
yet a common practice to normalize CNN activations. 
What are the viable choices and which one is the best? 

3 How many components in the FV representation? The 
GMM model in FV consists of K Gaussian compo¬ 
nents. It is known that in general a large K {e.g., 
256) leads to high accuracy for fully connected activa¬ 
tions [7, 27], dense SIFT [14] and action features [23]. 
However, a large K leads to a very long (hundreds of 
thousands of dimensions) representation. Is a large K 
really necessary? 

4 Shall we capture spatial information (andhow?) A gen¬ 
eral CNN requires a fixed input image size. He et al. [9] 
proposed a SPP-Net to remove the fixed-size constraint, 
which also inspired a Spatial Pyramid Pooling (SPP). 
The SPP-Net pooled deep activations of the last convo¬ 
lutional layer and generated fixed length outputs, then 
the pooled activations were fed into the fully connected 
layers. Is there a simpler and more natural way to cap¬ 
ture spatial information? 

5 Shall we use information from multiple scales? Yoo et 
al. [27] replaces the fully connected layers with equiv¬ 
alent convolutional layers to obtain large amount of 
dense deep descriptors. Then, all the activations are 
merged into a single vector by Multi-scale Pyramid 
Pooling (MPP). MPP utilizes multi-scale CNNs’ activa¬ 
tions. MPP, however, is computationally expensive. Is 
there an efficient way to capture information from mul¬ 
tiple scales? 

These factors may seem too detailed to be important. 
However, existing methods adopted very different decisions 
to these questions, and these differences may well explain 
their performance differences. We summarize these differ¬ 
ences in Table 1. 

In Table 1, “DF” refers to deep features, where “F” and 
“C” represent the fully connected and convolutional layer, 
respectively. “Norm” refers to how the deep activations 
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Figure 1. The image classification framework. DSP feeds an arbitrary resolution input image into a pre-trained CNN model to extract deep 
activations. A GMM visual dictionary is trained based on the deep descriptors from training images. Then, a spatial pyramid partitions the 
deep activations of an image into m blocks in N pyramid levels. In this way, each block activations are represented as a single vector by 
the improved Fisher Vector. Finally, we concatenate the m single vectors to form a 2mciA'-dimensional feature vector as the final image 
representation. 


Table 1. Summary of decisions in related methods 


Methods 

DF 

Resolution 

Norm 

PCA 

K 

SP 
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SPP-net 
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MOP 
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X 
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X 

V 

MPP 

C 
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X 

V 

256 

X 

V 

D-CNN 

C 

any 

X 

X 

64 

X 

V 

DSP 

c 

any 

V 

X 

1,2,3,4 

V 
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are normalized; “K” indicates the number of visual words 
or Gaussian components; “SP” refers to spatial pyramid; 
“Ms” refers to multiple scale. In addition, means that 
a method does not involve the corresponding factor. Some 
methods also use PCA to reduce the dimensionality of deep 
activations. 

From Table 1, it is clear that the proposed DSP is flexi¬ 
ble (accepting any size image), efficient (fully convolutional 
and very small K), and making full use of the image (spa¬ 
tial pyramid and multiple scales). We will explain how these 
decisions and choices are made in the next section. 

3. Factors, choices and decisions 

We study the 5 factors in this section in Sec. 3.1-3.4, 
respectively. The effect of K size, however, is studied sep¬ 
arately in Sec. 4. 

3.1. Convolutional vs. fully connected layer 

Convolutional neural networks consist of alternatively 
stacked convolutional layers and pooling layers, followed 
by one or more fully connected layers. The convolutional 
layers generate feature maps by linear convolutional filters 
with nonlinear activation functions such as rectified linear 
units, then the feature maps max-pool the outputs within lo¬ 
cal neighborhoods. Finally, the activations of the last convo¬ 
lutional layer are fed into fully connected layers, followed 
by a soft-max classifier. 



(a) An image 


(b) The 194th 
feature map 


(c) The 207th 
feature map 


Figure 2. Visualization of the feature maps. (2a) is an image from 
the PASCAL VOC2007 dataset, (2b) and (2c) are different feature 
maps of the input image. 


However, the feature map of top convolutional layers are 
known to contain mid- and high-level information, e.g., ob¬ 
ject parts or complete objects [29]. As shown in Fig. 2, 
we visualize the input image’s feature maps which are gen¬ 
erated by the last convolutional layer. In this figure, the 
strongest response of the 194th and 207th feature map are 
corresponding to the person and motorcycle in the input im¬ 
age, respectively. Thus, one major difference between con¬ 
volutional and fully connected layer activations is that the 
former is directly embedded with rich semantic information 
of image patches, while the latter not necessarily be so. 

Furthermore, the fully connected layers require a fixed 
image size (e.g., 224 x 224). On the contrary, convolutional 
layers accept input images of arbitrary resolution or aspect 
ratio. The pools activations can be formulated as a order-3 
tensor of size hxwxd, which include hxw cells and each 
cell contains one d-dimensional deep descriptor. For exam¬ 
ple, we will get a7x7x512 activations if the input image 
size is 224 x 224. Convolutional layer deep descriptors have 
been successfully in [13, 2, 25]. 

These deep descriptors contain more spatial information 
compared to the activation of the fully connected layers, 
e.g., the top-left cell’s d-dim deep descriptor is generated 
using only the top-left part of the input image, ignoring 
other pixels. In addition, fully connected layers have large 



































































Table 2. Results of the different normalization methods 



CaltechlOl 

Stanford40 

Scene15 

Indoor67 

No 

90.63 

74.84 

90.75 

71.20 

^2 vector 

92.02 

73.41 

90.92 

74.03 

^2 matrix 

92.56 

78.43 

90.99 

74.55 

PCA-l-f 2 matrix 

91.95 

75.69 

90.22 

71.79 


computational cost, because it contains roughly 90% of all 
the parameters of the whole CNN model. 

Thus, in DSP we use a fully convolutional network by 
removing the fully connected layers. 

3.2. Normalization and pooling of deep descriptors 

Let X = [tci,..., ..., (X„ S be the 

matrix of d-dimensional deep descriptors extracted from an 
image I via a pre-trained CNN model. X was usually pro¬ 
cessed by dimensionality reduction methods such as PCA, 
before they are pooled into a single vector using VLAD 
or FV [7, 27]. PCA is usually applied to the SIFT fea¬ 
tures or fully connected layer activations, since it is em¬ 
pirically shown to improve the overall recognition perfor¬ 
mance. Flowever, our experiments show that PCA signifi¬ 
cantly hurts recognition when applied to the fully convolu¬ 
tional activations. Thus, it is not applied to fully convolu¬ 
tional deep descriptors in this paper. 

In addition, each deep descriptors Xt inside X is not nor¬ 
malized in current processing of deep visual descriptors [ 2 ]. 
We hrst try to normalize Xt with the ^2 vector normalization 
(i.e., Xt ^ Xt/\\xt\\ 2 ), which leads to better results than 
null normalization on most datasets, except in Stanford40, 
as shown in Table 2. 

We also propose a novel ^2 matrix normalization {i.e., 
Xt ^ a;t/|| 2 f|j 2 ), where || 2 f ||2 is the matrix spectral norm, 
i.e., largest singular value of X. This normalization has a 
benefit that it normalizes Xt using the information from the 
entire image X. It is a bit surprising to observe that it is 
more effective than the commonly used ^2 vector normal¬ 
ization, and sometimes by a large margin. An intuitive in¬ 
terpretation is that the (.2 matrix normalization can use the 
global information, making it more robust to changes such 
as illumination and scale. 

In order to evaluate the effect of these normalization and 
PCA for classification performance, we use 4 datasets. We 
use the original resolution of input images without cropping 
or warping and pool activations by using FV with K = A 
{i.e., the GMM has 4 Gaussian components). The experi¬ 
mental results are reported in Table 2. The ^2 matrix nor¬ 
malization before using FV is found to be important for bet¬ 
ter performance. 

The size of pools is a parameter in CNN because input 
images have arbitrary sizes. However, the classihers {e.g., 
SVM or soft-max) require hxed length vectors. Thus, all 
the deep descriptors of an image must be pooled to form a 


single vector. We use the Fisher Vector (FV) to encode the 
deep descriptors. 

We denote the parameters of the GMM with K compo¬ 
nents by A = {wfc, fij^, (Tk] k = 1,..., K}, where Wk, 
and (Tk the mixture weight, mean vector and covariance 
matrix of the Gaussian component, respectively. The 
covariance matrices are diagonal and cr\ are the variance 
vectors. Let 74 (fc) be the soft-assignment weight of Xt with 
respect to the fc-th Gaussian, the FV representation corre¬ 
sponding to fij^ and (Tk are presented as follows [14]: 
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Note that, f^^{X) and f^^{X) are both d-dimensional 
vectors. The final Fisher Vector f\{X) is the concatena¬ 
tion of the gradients f {X) and (AT) for all K Gaus¬ 
sian components. Thus, FV can represent the set of deep 
descriptors X with a 2ciiT-dimensional vector. In addi¬ 
tion, the Fisher Vector f \{X) is improved by the power- 
normalization with the factor of 0.5, followed by the £2 vec¬ 
tor normalization [14]. 

We will further study how to choose a proper K size for 
FV in Sec. 4. 


3.3. Deep spatial pyramid 

The proposed method is named as DSP (Deep Spatial 
Pyramid), since adding spatial pyramid information is the 
key part of DSP. Adding spatial information through a spa¬ 
tial pyramid [ 1 2 ] have been shown to significantly improve 
image recognition performance when dense SIFT features 
are used. How can we efficiently and effectively utilize the 
spatial information with fully convolutional activations? 

The SPP-net method [9] adds a spatial pyramid pooling 
layer to deep nets, which has improved recognition perfor¬ 
mance. However, since we are using FV to pool activa¬ 
tions from a fully convolutional network, a more intuitive 
and natural way exists. 

As previously discussed, one single cell (deep descrip¬ 
tor) in the last convolutional layer corresponds to one local 
image patch in the input image, and the set of all convolu¬ 
tional layer cells form a regular grid of image patches in the 
input image. This is a direct analogy to the dense SIFT fea¬ 
ture extraction framework. Instead of a regular grid of SIFT 
vectors extracted from 16 x 16 local image patches, a grid 
of deep descriptors are extracted from larger image patches 
by a CNN. 

Thus, we can easily form a natural deep spatial pyramid 
by partitioning an image into sub-regions and computing 
local features inside each sub-region. In practice, we just 















Level 1 Level 0 


Figure 3. Illustration of the level 1 and 0 deep spatial pyramid. 

need to spatially partition the cells of activations in the last 
convolutional layer, and then pool deep descriptors in each 
region separately using FV. The operation of DSP is illus¬ 
trated in Fig. 3. 

The level 0 simply aggregates all cells using FV. The 
level 1, however, splits the cells into 5 regions according 
to their spatial locations: the 4 quadrants and 1 centerpiece. 
Then, 5 FVs are generated from activations inside each spa¬ 
tial region. Note that the level 1 spatial pyramid we use is 
different from the classic one in [12]. We follow Wu and 
Rehg [22] to use an additional spatial region in the center 
of the image. A DSP using two levels will then concatenate 
all 6 FVs from level 0 and level 1 to form the final image 
representation. 

This proposed DSP method is summarized in Algo¬ 
rithm 1. 


Algorithm 1 The DSP pipeline 

1: Input: 

2: An input image I 

3: A pre-trained CNN model 

4: Procedure: 

5: Extract deep descriptors X from I using the 

pre-defined model, X = [a:i,..., att,..., 

6: For each activation vector Xt, perform £2 matrix 

normalization Xt ^ a;t/||Ai ||2 
7: (Estimate a GMM A = {uJk, /^fc, trfc} using the 

training set); 

8: Generate a spatial pyramid {2fi,..., X^} for X 

9: for all 1 < i < m 

.■■,f^jx,), f^jx, )] 

11: fx{Xi) ^ sign{f^{Xi))y^f^{Xi) 

12: /,(X,)^/a(^.)/II/a(^.)I|2 

13: end for 

14: Concatenate f^{Xi), 1 < i < m, to form the final 

spatial pyramid representation f{X) 

15: fiX)^ f{X)/\\fiX)\\2 

16: Output: f{X). 


3.4. Multi-scale DSP 

In order to capture variations of the activations caused 
by variations of objects in an image, we generate a multiple 


scale pyramid, extracted from S different rescaled versions 
of the original input image. We feed images of all different 
scales into a pre-trained CNN model and extract deep acti¬ 
vations. In each scale, the corresponding rescaled image is 
encoded into a 2mdK-dimensional vector by DSP. There¬ 
fore, we have S vectors of 2mfiAr-dimensions and they are 
merged into a single vector by average pooling, as 

1 ® 

fm = ( 3 ) 

S—1 

where is the DSP representation extracted from the scale 
level s. Einally, £2 normalization is applied to /„. Note 
that each vector is already £2 normalized, as shown in 
Algorithm 1. 

The multi-scale DSP is related to MPP proposed by Yoo 
et al. [27]. A key different between our method and MPP is 
that /g encodes spatial information while MPP does not. 

4. A small K is better in FV in DSP 

In this section, we will discuss one key character of DSP, 
i.e., the number of GMM’s components. 

Our experiments show that in DSP, when the number of 
GMM’s components K is small {e.g., from 1 to 4), it will 
achieve satisfactory classification performances. In fact, 
when different K are used, the highest recognition accuracy 
is usually achieved by setting AT to 1 or 2! 

This phenomenon is not consistent with common prac¬ 
tices in image classification by using local descriptors via 
the EV encoding. When deep learning features are used to¬ 
gether with EV, a large K value is also used. Moreover, 
Yoo et al. [27] specified the value of K to be 256 when they 
trained their visual vocabulary. More previous examples of 
large K values can be found in Table 1. Having a small K 
value is very beneficial in terms of CPU and storage costs, 
however, why is DSP requiring a small AT? 

We believe the answer is because DSP uses a small num¬ 
ber of deep descriptors per image, i.e., hxw is a small inte¬ 
ger. We usually extract no more than 100 512-dimensional 
deep descriptors from the last convolutional layer from one 
image, while [27] represented one image with 4,410 vectors 
of 4,096 dimensional dense CNN activations. If the value 
of AT is specified as a large number (e.g., 128 or 256), the 
resulting EV representation will be problematic. 

Eirst, if a large K is used in DSP, there will not be enough 
deep descriptors to estimate an accurate GMM model, be¬ 
cause each training image will only contribute few number 
of deep descriptors. An inaccurate GMM model will ad¬ 
versely affect the classification performance seriously. Sec¬ 
ond, many EV components will only contain zeros, because 
there are more Gaussian components than CNN descriptors. 
We conjecture that this will cause EV to lose accuracy. 
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Figure 4. Plot of w values in DSP. For each of the seven datasets used in our experiments, we vary the numbers of Gaussian components 
K to be 64 or 256. (a) and (b) are plots for the Caltech-101 data set, with K being 64 and 256, respectively. The meaning of other plots 
can be deduced from their captions similarly. Note that, the plots for ScenelS are not similar to other plots. When K is larger than 4, DSP 
could achieve satisfactory classification accuracy rates in Scene 15, a trend that is consistent with the plots shown in (g) and (h). 




(a) Caltech-101 and ScenelS 
Figure 5. Classification performance of DSP 


(b) Stanford40 and Indoor67 
and Ms-DSP with different numbers of Gaussians 


We also empirically study this phenomenon. As shown 
in Fig. 4, we plot distribution of GMM components’ pri¬ 
ors (i.e., Lo) in DSP. There are 14 plots for the 7 datasets 
used in our experiments. Two plots are shown for each data 
set, which corresponds to different number of GMM com¬ 
ponents (shown as the horizontal axis), i.e., 64 and 256. The 
vertical axis shows the value of w for each Gaussian com¬ 
ponent. 


It is obvious to find that: for most datasets, one or two 
uj values are much larger than the rest. For example, when 
K — 6A in the SUN 397 dataset, the two tall bars indicate 
that two UJ values are above 0.3, and their sum is around 0.7. 
In other words, only 2 Gaussian components are responsi¬ 
ble for more than 70% of the variations of the distribution. 
The rest 30% might be related to noisy or background image 
patches. Thus, K = 2 might be the best choice in this par- 





























































ticular case. In most datasets, we can observe the same phe¬ 
nomenon; one or two Gaussian components are dominating 
the entire distribution. This observation might explain why 
DSP just needs a small number of Gaussian components. 
Since a small value of K in DSP will cause a much lower 
computational cost, it is efficient to handle large scale image 
classification tasks. 

We further evaluate the impact of K in DSP and multiple 
scale DSP (Ms-DSP). We show the classification results in 
Fig. 5 as a function of the number of Gaussians {i.e., K) of 
the GMM, and K is increased by a factor of 2. A smaller 
K (e.g., K = 2) always obtains better classification perfor¬ 
mance for DSP and Ms-DSP. With the increasing of K, we 
can see that DSP and Ms-DSP lead to a drop in the discrim¬ 
inative ability. DSP or Ms-DSP feature vector may be too 
sparse when K is increased, which is detrimental to clas¬ 
sification. When K = 2, a DSP representation has only 
2 X 512 X 2 X 6 = 12288 dimensions. The entire DSP 
pipeline (from reading in an image till emitting a predic¬ 
tion) requires on average 0.15 second per image. 

For a fixed K, Ms-DSP always significantly outperforms 
DSP. This is not surprising since, for a given K, Ms-DSP 
captures more information from rescaled images, which 
DSP does not have access to. 

5. Experiments 

The purpose of this section is to evaluate the perfor¬ 
mance of DSP as a complete pipeline. We report results in 
three object recognition datasets, Caltech-101 [5], Caltech- 
256 [8] and Pascal VOC 2007 [3], and three scene recogni¬ 
tion datasets, ScenelS categories [12], MIT Indoor67 [15] 
and SUN397 [16], and one action recognition data set, Stan- 
ford40 [26]. Except for Pascal VOC 2007 and MIT In- 
door67 which have fixed training and test splittings, all ex¬ 
periments on the other datasets are repeated as the average 
of three randomly sampled train/test splittings. 

5.1. Datasets 

Caltech-101 [5] contains 9K labeled images of 101 ob¬ 
ject categories and a background category. We follow the 
procedure of [5] and randomly select 30 images per cat¬ 
egory for training and test on up to 50 images per class 
in every split. Caltech-256 [8] with 3IK images and 257 
classes is an improvement of Caltech-101. Following [8], 
each split contains 60 training images per class and the rest 
is used for test. For PASCAL VOC 2007 which contains 20 
object classes, we use its standard protocol and measure the 
average precision (AP) and report the mean AP (mAP) of 
20 categories. 

Scenel5 is composed of 15 different kinds of scenes, 
where each category has 200 to 400 images. We randomly 
select 100 images per class for training and the rest for test, 
following [30]. MITIndoor67 [15] is a challenging indoor 


data set comparing with outdoor scene recognition. The 
dataset has 15,620 images with 67 indoor scene categories. 
The standard split [15] for this dataset consists of 80 train¬ 
ing and 20 test images per category. SUN397 [24] is the 
largest data set for scene recognition. It contains 397 cate¬ 
gories and each category has at least 100 images. The train¬ 
ing and test splits are fixed and publicly available from [24], 
where each split has 50 training and 50 test images per cate¬ 
gory. We select the first three splits from the 10 public splits 
in our experiments. 

StanforddO [26] contains 40 diverse daily human actions 
and with 180^300 images for each category. In each split¬ 
ting, we randomly select 100 images in each class for train¬ 
ing and the remaining for test. 

In our experiments, average accuracy rate is used to 
evaluate the classification performances on Caltech-101, 
Caltech-256, MIT Indoor67, Scenel5, SUN397, and Stan- 
forddO. For PASCAL VOC 2007, we employ mean average 
precision (mAP) to evaluate our proposed method and other 
approaches. 

5.2. Experiment details 

In our DSP, VGG Net-D [17] is employed as the pre¬ 
trained CNN model to extract deep activations. For simplic¬ 
ity, pre-trained CNN model weights are kept fixed without 
fine-tuning. Note that, we just employ VGG Net-D with¬ 
out its fully connected layers in our experiments, thus can 
accept input images of arbitrary sizes. Input images do not 
need to be resized into a fixed aspect ratio. However, con¬ 
sidering running efficiency, an image is resized such that the 
smallest and largest edge of input image will not be lower 
than 224 or higher than 1120, respectively. In addition, each 
image is preprocessed by subtracting the per-pixel mean 
(of the ImageNet images and provided along with the CNN 
model). 

We use K = 2 'm FV in this section. An image is rep¬ 
resented by the concatenation of FVs from all the 6 sub¬ 
blocks in a two level deep spatial pyramid. For using multi¬ 
scale, the rescaled images are s times of the of original input 
image, where s € {1.4,1.2,1.0,0.8, 0.6}. the FVs of all 
five scale are merged into a single vector by average pool¬ 
ing as Eq. 3. 

One-versus-rest linear SVM is used for classification. 
Eollowing [30], all classifiers use the same parameters C = 
1 for fair comparisons. Our experiments use the following 
open source libraries: VLEeat [20], MatConvNet [21] and 
LIBLINEAR [4]. 

5.3. Main results 

State-of-the-art and two baseline results are reported in 
Table 3. In particular, the first baseline method is fcg which 
is extracted from the last fully connected layer. To extract 
fcg feature, we resize the image so that its resolution is 


Table 3. Recognition accuracy (or mAP) comparisons on seven datasets. The highest accuracy (mAP) of each column is marked in bold. 
[17]’s results were achieved using VGG Net-D and VGG Net-E, evaluation was measured by mean class recall on Caltech-101, Caltech-256 
instead of accuracy . 


Methods 

Description 

Caltech-101 

Caltech-256 

VOC 2007 

Scene15 

SUN397 

MIT Indoor67 

Stanford40 


[9] 

93 . 42 ± 0.50 

- 

82.44 




- 


[ ] 

- 

- 

- 


51.98 

68.88 

- 

SoA 

[27] 

- 

- 

82.13 

- 

- 

77.56 

- 

[30] 

84.79±0.66 

65.06±0.25 

- 

91.59±0.48 

53.86±0.21 

70.80 

55.28±0.64 


[1] 

88.35±0.56 

77.61±0.12 

82.4 




- 


[17] 

92.7±0.5 (*) 

86 . 2 ± 0 . 3 (*) 

89.7 

- 

- 

- 

- 

Baseline 

FC8 

90.55±0.31 

82.02±0.12 

84.61 

89.88±0.76 

53.90±0.45 

69.78 

71.53±0.34 


Pools-l-FV 

90.03±0.75 

79.48±0.53 

88.12 

89.00±0.42 

51.39±0.51 

71.57 

73.96±0.52 


DSP 

94.66±0.26 

84.22±0.11 

88.60 

91.13±0.77 

57.27±0.34 

76.34 

79.75±0.34 

Our 

Ms-DSP 

95 . 11 ± 0.26 

85 . 47 ± 0.14 

89.31 

91 . 78 ± 0.22 

59 . 78 ± 0.47 

78.28 

80 . 81 ± 0.29 


Table 4. Per-class classification performance on PASCAL VOC 2007. 


Methods 

Description 

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv 

Baseline 

Fcs 

96.27 90.81 93.81 92.40 58.24 86.01 90.92 91.91 69.45 78.08 79.36 90.87 91.69 88.98 95.35 61.31 88.14 71.68 96.53 80.28 

Our 

Pools+FV 

DSP 

Ms-DSP 

97.23 94.44 96.12 93.54 70.99 88.45 93.43 95.48 71.16 81.33 82.21 93.55 95.08 90.51 97.64 69.84 88.70 77.42 96.92 88.29 

97.45 94.12 96.79 94.98 69.64 87.99 93.28 95.76 72.75 81.65 85.07 94.31 94.84 91.57 97.53 69.61 89.42 80.14 97.47 87.64 

97.67 95.24 96.84 94.47 70.58 89.32 93.50 95.92 74.61 83.99 85.68 95.27 95.37 92.02 97.42 71.05 90.82 80.57 97.69 88.14 


224 X 224. ^ 2 -normalization is applied to the fcg activa¬ 
tions before employing SVM, which was suggested in [1]. 
The other baseline is the pools+FV where deep descriptors 
are aggregated to single vector by orderless FV pooling. In 
order to compare fairly, we use the same resolution of input 
image as in our DSP. 

On most datasets, fcg already performs well. P 00 I 5 pro¬ 
duces quite good results even though the P 00 I 5 activations 
are computed using only 10% of the CNN parameters of the 
complete CNN model, which shows that fully convolutional 
features (with small K in FV and £2 matrix normalization) 
are powerful, especially on VOC2007 (84.61% —)■ 88.12%) 
and Stanford40 (71.53% ^ 73.96%). 

DSP and multi-scale DSP can signihcantly outperform 
baseline and state-of-the-arts methods. Compared to the 
baselines, DSP improves performance in all datasets by 1- 
5%, especially on SUN397 (53.90%—>-59.27%) and Stan- 
ford40 (73.96%—>79.75%). This gain is mainly due to the 
fact that DSP can capture the spatial information on top of 
pools activations. On the other hand, the fully convolutional 
network relaxes the constraint that the input images must 
have the same fixed size, thus the full image can be fed into 
a pre-trained CNN without changing its aspect ratio. Com¬ 
bining multiple scale and DSP (Ms-DSP) achieves the best 
recognition performance on all datasets. Since fully convo¬ 
lutional and small K are used, Ms-DSP is still very efficient. 

Our DSP and Ms-DSP can achieve mean recall 96.38 ± 
0.53 and 96.88 ± 0.59 on Caltech-101, respectively, and 
90.05 ± 0.07 and 90.89 ± 0.17 on Caltech-256, respec¬ 
tively. These results are significantly higher than that of [ 1 7] 
(92.7% for Caltech-101 and 86.2% for Caltech-256). 

In addition, on the VOC2007 dataset, our best perfor¬ 


mance is slightly lower (0.4%) than that in [17]. However, 
[17] used fusion feature which was computed using two pre¬ 
trained CNN {i.e., VGG Net-D and VGG Net-E). Detailed 
VOC results in Table 4 show that our methods are better 
than fcg in every category. 

6. Conclusion 

In order to present a powerful deep feature representa¬ 
tion, details have to be made right. In other words, deci¬ 
sions for important factors must be carefully studied and 
made. In this paper, we picked a list of 5 important factors 
and provided our answers to them. The main hndings of 
this paper form a complete pipeline DSP (deep spatial pyra¬ 
mid), which integrates the following components; activa¬ 
tions from the last convolutional layer, naturally processing 
input image of any size instead of fixed size, dense deep fea¬ 
tures extracted from multiple scales, and most importantly, 
a natural way to build a spatial pyramid in deep learning. 
DSP, in spite of being simple and efficient, has excellent 
performance in many benchmark datasets. 

In particular, we emphasize the following new hnd¬ 
ings. 

• Normalization: ^2 matrix normalization is more effec¬ 
tive than unnormalized or ^2 vector normalization. 

• DSP: DSP can effectively capture the spatial informa¬ 
tion in a natural and efficient manner. 

• K size in FV: Pooling deep descriptors only need small 
number of Gaussian components in the Fisher Vector, 
which leads to lower computational costs. 

Other factors and details can be further considered in 
the DSP framework, which we will study in the future. 


















For example, convolutional activations from multiple lay¬ 
ers (cross-layer [13]) might further improve classification 

accuracy. And VLAD might be a better fit than FV for ag¬ 
gregating deep convolutional activations [25]. 
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